MoneyBall Project - Solutions

Rules of Baseball

You don't need to know much about Baseball to complete this exercise. If you're totally unfamiliar with Baseball, check out this useful explanatory video!


Background

Source: Wikipedia

The 2002 Oakland A's

The Oakland Athletics' 2002 season was the team's 35th in Oakland, California. It was also the 102nd season in franchise history. The Athletics finished first in the American League West with a record of 103-59.

The Athletics' 2002 campaign ranks among the most famous in franchise history. Following the 2001 season, Oakland saw the departure of three key players (the lost boys). Billy Beane, the team's general manager, responded with a series of under-the-radar free agent signings. The new-look Athletics, despite a comparative lack of star power, surprised the baseball world by besting the 2001 team's regular season record. The team is most famous, however, for winning 20 consecutive games between August 13 and September 4, 2002.[1] The Athletics' season was the subject of Michael Lewis' 2003 book Moneyball: The Art of Winning an Unfair Game (as Lewis was given the opportunity to follow the team around throughout that season)

This project is based off the book written by Michael Lewis (later turned into a movie).

Moneyball Book

The central premise of book Moneyball is that the collective wisdom of baseball insiders (including players, managers, coaches, scouts, and the front office) over the past century is subjective and often flawed. Statistics such as stolen bases, runs batted in, and batting average, typically used to gauge players, are relics of a 19th-century view of the game and the statistics available at that time. The book argues that the Oakland A's' front office took advantage of more analytical gauges of player performance to field a team that could better compete against richer competitors in Major League Baseball (MLB).

Rigorous statistical analysis had demonstrated that on-base percentage and slugging percentage are better indicators of offensive success, and the A's became convinced that these qualities were cheaper to obtain on the open market than more historically valued qualities such as speed and contact. These observations often flew in the face of conventional baseball wisdom and the beliefs of many baseball scouts and executives.

By re-evaluating the strategies that produce wins on the field, the 2002 Athletics, with approximately US 44 million dollars in salary, were competitive with larger market teams such as the New York Yankees, who spent over US$125 million in payroll that same season.

Because of the team's smaller revenues, Oakland is forced to find players undervalued by the market, and their system for finding value in undervalued players has proven itself thus far. This approach brought the A's to the playoffs in 2002 and 2003.

In this project we'll work with some data and with the goal of trying to find replacement players for the ones lost at the start of the off-season - During the 2001–02 offseason, the team lost three key free agents to larger market teams: 2000 AL MVP Jason Giambi to the New York Yankees, outfielder Johnny Damon to the Boston Red Sox, and closer Jason Isringhausen to the St. Louis Cardinals.

The main goal of this project is for you to feel comfortable working with R on real data to try and derive actionable insights!

Let's get started!

Follow the steps outlined in bold below using your new R skills and help the Oakland A's recruit under-valued players!

Data

We'll be using data from Sean Lahaman's Website a very useful source for baseball statistics. The documentation for the csv files is located in the readme2013.txt file. You may need to reference this to understand what acronyms stand for.

Use R to open the Batting.csv file and assign it to a dataframe called batting using read.csv

In [1]:
batting <- read.csv('Batting.csv')

Use head() to check out the batting

In [2]:
head(batting)
Out[2]:
playerIDyearIDstintteamIDlgIDGG_battingABRHX2BX3BHRRBISBCSBBSOIBBHBPSHSFGIDPG_old
1aardsda0120041SFNNL1111000000000000000011
2aardsda0120061CHNNL4543200000000000010045
3aardsda0120071CHAAL25200000000000000002
4aardsda0120081BOSAL47510000000001000005
5aardsda0120091SEAAL7330000000000000000NA
6aardsda0120101SEAAL5340000000000000000NA

Use str() to check the structure. Pay close attention to how columns that start with a number get an 'X' in front of them! You'll need to know this to call those columns!

In [3]:
str(batting)
'data.frame':	97889 obs. of  24 variables:
 $ playerID : Factor w/ 18107 levels "aardsda01","aaronha01",..: 1 1 1 1 1 1 1 2 2 2 ...
 $ yearID   : int  2004 2006 2007 2008 2009 2010 2012 1954 1955 1956 ...
 $ stint    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ teamID   : Factor w/ 149 levels "ALT","ANA","ARI",..: 117 35 33 16 116 116 93 80 80 80 ...
 $ lgID     : Factor w/ 6 levels "AA","AL","FL",..: 4 4 2 2 2 2 2 4 4 4 ...
 $ G        : int  11 45 25 47 73 53 1 122 153 153 ...
 $ G_batting: int  11 43 2 5 3 4 NA 122 153 153 ...
 $ AB       : int  0 2 0 1 0 0 NA 468 602 609 ...
 $ R        : int  0 0 0 0 0 0 NA 58 105 106 ...
 $ H        : int  0 0 0 0 0 0 NA 131 189 200 ...
 $ X2B      : int  0 0 0 0 0 0 NA 27 37 34 ...
 $ X3B      : int  0 0 0 0 0 0 NA 6 9 14 ...
 $ HR       : int  0 0 0 0 0 0 NA 13 27 26 ...
 $ RBI      : int  0 0 0 0 0 0 NA 69 106 92 ...
 $ SB       : int  0 0 0 0 0 0 NA 2 3 2 ...
 $ CS       : int  0 0 0 0 0 0 NA 2 1 4 ...
 $ BB       : int  0 0 0 0 0 0 NA 28 49 37 ...
 $ SO       : int  0 0 0 1 0 0 NA 39 61 54 ...
 $ IBB      : int  0 0 0 0 0 0 NA NA 5 6 ...
 $ HBP      : int  0 0 0 0 0 0 NA 3 3 2 ...
 $ SH       : int  0 1 0 0 0 0 NA 6 7 5 ...
 $ SF       : int  0 0 0 0 0 0 NA 4 4 7 ...
 $ GIDP     : int  0 0 0 0 0 0 NA 13 20 21 ...
 $ G_old    : int  11 45 2 5 NA NA NA 122 153 153 ...

Make sure you understand how to call the columns by using the $ symbol.

Call the head() of the first five rows of AB (At Bats) column

In [4]:
head(batting$AB)
Out[4]:
  1. 0
  2. 2
  3. 0
  4. 1
  5. 0
  6. 0

Call the head of the doubles (X2B) column

In [5]:
head(batting$X2B)
Out[5]:
  1. 0
  2. 0
  3. 0
  4. 0
  5. 0
  6. 0

Quick Note: If you used fread() to use data.table, then you won't need to worry about these X in front of numbers, instead you would use something like:

batting[,'2B',with=FALSE]

There's a few more ways of doing detailed here.

Alright! Let's move on!

Feature Engineering

We need to add three more statistics that were used in Moneyball! These are:

Click on the links provided and search the wikipedia page for the formula for creating the new statistic! For example, for Batting Average, you'll need to scroll down until you see:

$$ AVG = \frac{H}{AB} $$

Which means that the Batting Average is equal to H (Hits) divided by AB (At Base). So we'll do the following to create a new column called BA and add it to our data frame:

In [6]:
batting$BA <- batting$H / batting$AB

After doing this operation, check the last 5 entries of the BA column of your data frame and it should look like this:

In [7]:
tail(batting$BA,5)
Out[7]:
  1. 0.123076923076923
  2. 0.274647887323944
  3. 0.147058823529412
  4. 0.274509803921569
  5. 0.213872832369942

Now do the same for some new columns! On Base Percentage (OBP) and Slugging Percentage (SLG). Hint: For SLG, you need 1B (Singles), this isn't in your data frame. However you can calculate it by subtracting doubles,triples, and home runs from total hits (H): 1B = H-2B-3B-HR

  • Create an OBP Column
  • Create an SLG Column
In [8]:
# On Base Percentage
batting$OBP <- (batting$H + batting$BB + batting$HBP)/(batting$AB + batting$BB + batting$HBP + batting$SF)
In [9]:
# Creating X1B (Singles)
batting$X1B <- batting$H - batting$X2B - batting$X3B - batting$HR
In [10]:
# Creating Slugging Average (SLG)
batting$SLG <- ((1 * batting$X1B) + (2 * batting$X2B) + (3 * batting$X3B) + (4 * batting$HR) ) / batting$AB

Check the structure of your data frame using str()

In [11]:
str(batting)
'data.frame':	97889 obs. of  28 variables:
 $ playerID : Factor w/ 18107 levels "aardsda01","aaronha01",..: 1 1 1 1 1 1 1 2 2 2 ...
 $ yearID   : int  2004 2006 2007 2008 2009 2010 2012 1954 1955 1956 ...
 $ stint    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ teamID   : Factor w/ 149 levels "ALT","ANA","ARI",..: 117 35 33 16 116 116 93 80 80 80 ...
 $ lgID     : Factor w/ 6 levels "AA","AL","FL",..: 4 4 2 2 2 2 2 4 4 4 ...
 $ G        : int  11 45 25 47 73 53 1 122 153 153 ...
 $ G_batting: int  11 43 2 5 3 4 NA 122 153 153 ...
 $ AB       : int  0 2 0 1 0 0 NA 468 602 609 ...
 $ R        : int  0 0 0 0 0 0 NA 58 105 106 ...
 $ H        : int  0 0 0 0 0 0 NA 131 189 200 ...
 $ X2B      : int  0 0 0 0 0 0 NA 27 37 34 ...
 $ X3B      : int  0 0 0 0 0 0 NA 6 9 14 ...
 $ HR       : int  0 0 0 0 0 0 NA 13 27 26 ...
 $ RBI      : int  0 0 0 0 0 0 NA 69 106 92 ...
 $ SB       : int  0 0 0 0 0 0 NA 2 3 2 ...
 $ CS       : int  0 0 0 0 0 0 NA 2 1 4 ...
 $ BB       : int  0 0 0 0 0 0 NA 28 49 37 ...
 $ SO       : int  0 0 0 1 0 0 NA 39 61 54 ...
 $ IBB      : int  0 0 0 0 0 0 NA NA 5 6 ...
 $ HBP      : int  0 0 0 0 0 0 NA 3 3 2 ...
 $ SH       : int  0 1 0 0 0 0 NA 6 7 5 ...
 $ SF       : int  0 0 0 0 0 0 NA 4 4 7 ...
 $ GIDP     : int  0 0 0 0 0 0 NA 13 20 21 ...
 $ G_old    : int  11 45 2 5 NA NA NA 122 153 153 ...
 $ BA       : num  NaN 0 NaN 0 NaN ...
 $ OBP      : num  NaN 0 NaN 0 NaN ...
 $ X1B      : int  0 0 0 0 0 0 NA 85 116 126 ...
 $ SLG      : num  NaN 0 NaN 0 NaN ...

Merging Salary Data with Batting Data

We know we don't just want the best players, we want the most undervalued players, meaning we will also need to know current salary information! We have salary information in the csv file 'Salaries.csv'.

Complete the following steps to merge the salary data with the player stats!

Load the Salaries.csv file into a dataframe called sal using read.csv

In [12]:
sal <- read.csv('Salaries.csv')

Use summary to get a summary of the batting data frame and notice the minimum year in the yearID column. Our batting data goes back to 1871! Our salary data starts at 1985, meaning we need to remove the batting data that occured before 1985.

Use subset() to reassign batting to only contain data from 1985 and onwards

In [13]:
summary(batting)
Out[13]:
      playerID         yearID         stint           teamID        lgID      
 mcguide01:   31   Min.   :1871   Min.   :1.000   CHN    : 4720   AA  : 1890  
 henderi01:   29   1st Qu.:1931   1st Qu.:1.000   PHI    : 4621   AL  :44369  
 newsobo01:   29   Median :1970   Median :1.000   PIT    : 4575   FL  :  470  
 johnto01 :   28   Mean   :1962   Mean   :1.077   SLN    : 4535   NL  :49944  
 kaatji01 :   28   3rd Qu.:1995   3rd Qu.:1.000   CIN    : 4393   PL  :  147  
 ansonca01:   27   Max.   :2013   Max.   :5.000   CLE    : 4318   UA  :  332  
 (Other)  :97717                                  (Other):70727   NA's:  737  
       G            G_batting            AB              R         
 Min.   :  1.00   Min.   :  0.00   Min.   :  0.0   Min.   :  0.00  
 1st Qu.: 13.00   1st Qu.:  7.00   1st Qu.:  9.0   1st Qu.:  0.00  
 Median : 35.00   Median : 32.00   Median : 61.0   Median :  5.00  
 Mean   : 51.65   Mean   : 49.13   Mean   :154.1   Mean   : 20.47  
 3rd Qu.: 81.00   3rd Qu.: 81.00   3rd Qu.:260.0   3rd Qu.: 31.00  
 Max.   :165.00   Max.   :165.00   Max.   :716.0   Max.   :192.00  
                  NA's   :1406     NA's   :6413    NA's   :6413    
       H               X2B            X3B               HR        
 Min.   :  0.00   Min.   : 0.0   Min.   : 0.000   Min.   : 0.000  
 1st Qu.:  1.00   1st Qu.: 0.0   1st Qu.: 0.000   1st Qu.: 0.000  
 Median : 12.00   Median : 2.0   Median : 0.000   Median : 0.000  
 Mean   : 40.37   Mean   : 6.8   Mean   : 1.424   Mean   : 3.002  
 3rd Qu.: 66.00   3rd Qu.:10.0   3rd Qu.: 2.000   3rd Qu.: 3.000  
 Max.   :262.00   Max.   :67.0   Max.   :36.000   Max.   :73.000  
 NA's   :6413     NA's   :6413   NA's   :6413     NA's   :6413    
      RBI               SB                CS               BB        
 Min.   :  0.00   Min.   :  0.000   Min.   : 0.000   Min.   :  0.00  
 1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.: 0.000   1st Qu.:  0.00  
 Median :  5.00   Median :  0.000   Median : 0.000   Median :  4.00  
 Mean   : 18.47   Mean   :  3.265   Mean   : 1.385   Mean   : 14.21  
 3rd Qu.: 28.00   3rd Qu.:  2.000   3rd Qu.: 1.000   3rd Qu.: 21.00  
 Max.   :191.00   Max.   :138.000   Max.   :42.000   Max.   :232.00  
 NA's   :6837     NA's   :7713      NA's   :29867    NA's   :6413    
       SO              IBB              HBP               SH        
 Min.   :  0.00   Min.   :  0.00   Min.   : 0.000   Min.   : 0.000  
 1st Qu.:  2.00   1st Qu.:  0.00   1st Qu.: 0.000   1st Qu.: 0.000  
 Median : 11.00   Median :  0.00   Median : 0.000   Median : 1.000  
 Mean   : 21.95   Mean   :  1.28   Mean   : 1.136   Mean   : 2.564  
 3rd Qu.: 31.00   3rd Qu.:  1.00   3rd Qu.: 1.000   3rd Qu.: 3.000  
 Max.   :223.00   Max.   :120.00   Max.   :51.000   Max.   :67.000  
 NA's   :14251    NA's   :42977    NA's   :9233     NA's   :12751   
       SF             GIDP           G_old              BA       
 Min.   : 0.0    Min.   : 0.00   Min.   :  0.00   Min.   :0.000  
 1st Qu.: 0.0    1st Qu.: 0.00   1st Qu.: 11.00   1st Qu.:0.148  
 Median : 0.0    Median : 1.00   Median : 34.00   Median :0.231  
 Mean   : 1.2    Mean   : 3.33   Mean   : 50.99   Mean   :0.209  
 3rd Qu.: 2.0    3rd Qu.: 5.00   3rd Qu.: 82.00   3rd Qu.:0.275  
 Max.   :19.0    Max.   :36.00   Max.   :165.00   Max.   :1.000  
 NA's   :42446   NA's   :32521   NA's   :5189     NA's   :13520  
      OBP             X1B              SLG       
 Min.   :0.00    Min.   :  0.00   Min.   :0.000  
 1st Qu.:0.19    1st Qu.:  1.00   1st Qu.:0.179  
 Median :0.29    Median :  9.00   Median :0.309  
 Mean   :0.26    Mean   : 29.14   Mean   :0.291  
 3rd Qu.:0.34    3rd Qu.: 48.00   3rd Qu.:0.397  
 Max.   :1.00    Max.   :225.00   Max.   :4.000  
 NA's   :49115   NA's   :6413     NA's   :13520  
In [14]:
batting <- subset(batting,yearID >= 1985)

Now use summary again to make sure the subset reassignment worked, your yearID min should be 1985

In [15]:
summary(batting)
Out[15]:
      playerID         yearID         stint          teamID      lgID      
 moyerja01:   27   Min.   :1985   Min.   :1.00   SDN    : 1313   AA:    0  
 mulhote01:   26   1st Qu.:1993   1st Qu.:1.00   CLE    : 1306   AL:17226  
 weathda01:   26   Median :2000   Median :1.00   PIT    : 1299   FL:    0  
 maddugr01:   25   Mean   :2000   Mean   :1.08   NYN    : 1297   NL:18426  
 sierrru01:   25   3rd Qu.:2007   3rd Qu.:1.00   BOS    : 1279   PL:    0  
 thomeji01:   25   Max.   :2013   Max.   :4.00   CIN    : 1279   UA:    0  
 (Other)  :35498                                 (Other):27879             
       G           G_batting            AB              R         
 Min.   :  1.0   Min.   :  0.00   Min.   :  0.0   Min.   :  0.00  
 1st Qu.: 14.0   1st Qu.:  4.00   1st Qu.:  3.0   1st Qu.:  0.00  
 Median : 34.0   Median : 27.00   Median : 47.0   Median :  4.00  
 Mean   : 51.7   Mean   : 46.28   Mean   :144.7   Mean   : 19.44  
 3rd Qu.: 77.0   3rd Qu.: 77.00   3rd Qu.:241.0   3rd Qu.: 30.00  
 Max.   :163.0   Max.   :163.00   Max.   :716.0   Max.   :152.00  
                 NA's   :1406     NA's   :4377    NA's   :4377    
       H               X2B              X3B               HR        
 Min.   :  0.00   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
 1st Qu.:  0.00   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000  
 Median :  8.00   Median : 1.000   Median : 0.000   Median : 0.000  
 Mean   : 37.95   Mean   : 7.293   Mean   : 0.824   Mean   : 4.169  
 3rd Qu.: 61.00   3rd Qu.:11.000   3rd Qu.: 1.000   3rd Qu.: 5.000  
 Max.   :262.00   Max.   :59.000   Max.   :23.000   Max.   :73.000  
 NA's   :4377     NA's   :4377     NA's   :4377     NA's   :4377    
      RBI               SB                CS               BB        
 Min.   :  0.00   Min.   :  0.000   Min.   : 0.000   Min.   :  0.00  
 1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.: 0.000   1st Qu.:  0.00  
 Median :  3.00   Median :  0.000   Median : 0.000   Median :  3.00  
 Mean   : 18.41   Mean   :  2.811   Mean   : 1.219   Mean   : 14.06  
 3rd Qu.: 27.00   3rd Qu.:  2.000   3rd Qu.: 1.000   3rd Qu.: 21.00  
 Max.   :165.00   Max.   :110.000   Max.   :29.000   Max.   :232.00  
 NA's   :4377     NA's   :4377      NA's   :4377     NA's   :4377    
       SO              IBB               HBP               SH        
 Min.   :  0.00   Min.   :  0.000   Min.   : 0.000   Min.   : 0.000  
 1st Qu.:  1.00   1st Qu.:  0.000   1st Qu.: 0.000   1st Qu.: 0.000  
 Median : 12.00   Median :  0.000   Median : 0.000   Median : 0.000  
 Mean   : 27.03   Mean   :  1.171   Mean   : 1.273   Mean   : 1.465  
 3rd Qu.: 42.00   3rd Qu.:  1.000   3rd Qu.: 1.000   3rd Qu.: 2.000  
 Max.   :223.00   Max.   :120.000   Max.   :35.000   Max.   :39.000  
 NA's   :4377     NA's   :4378      NA's   :4387     NA's   :4377    
       SF              GIDP           G_old             BA       
 Min.   : 0.000   Min.   : 0.00   Min.   :  0.0   Min.   :0.000  
 1st Qu.: 0.000   1st Qu.: 0.00   1st Qu.: 11.0   1st Qu.:0.136  
 Median : 0.000   Median : 1.00   Median : 32.0   Median :0.233  
 Mean   : 1.212   Mean   : 3.25   Mean   : 49.7   Mean   :0.205  
 3rd Qu.: 2.000   3rd Qu.: 5.00   3rd Qu.: 77.0   3rd Qu.:0.274  
 Max.   :17.000   Max.   :35.00   Max.   :163.0   Max.   :1.000  
 NA's   :4378     NA's   :4377    NA's   :5189    NA's   :8905   
      OBP             X1B              SLG       
 Min.   :0.000   Min.   :  0.00   Min.   :0.000  
 1st Qu.:0.188   1st Qu.:  0.00   1st Qu.:0.167  
 Median :0.296   Median :  6.00   Median :0.333  
 Mean   :0.262   Mean   : 25.66   Mean   :0.304  
 3rd Qu.:0.342   3rd Qu.: 42.00   3rd Qu.:0.423  
 Max.   :1.000   Max.   :225.00   Max.   :4.000  
 NA's   :8821    NA's   :4377     NA's   :8905   

Now it is time to merge the batting data with the salary data! Since we have players playing multiple years, we'll have repetitions of playerIDs for multiple years, meaning we want to merge on both players and years.

Use the merge() function to merge the batting and sal data frames by c('playerID','yearID'). Call the new data frame combo

In [16]:
combo <- merge(batting,sal,by=c('playerID','yearID'))

Use summary to check the data

In [17]:
summary(combo)
Out[17]:
      playerID         yearID         stint          teamID.x     lgID.x    
 moyerja01:   27   Min.   :1985   Min.   :1.000   LAN    :  940   AA:    0  
 thomeji01:   25   1st Qu.:1993   1st Qu.:1.000   PHI    :  937   AL:12292  
 weathda01:   25   Median :1999   Median :1.000   BOS    :  935   FL:    0  
 vizquom01:   24   Mean   :1999   Mean   :1.098   NYA    :  928   NL:13105  
 gaettga01:   23   3rd Qu.:2006   3rd Qu.:1.000   CLE    :  920   PL:    0  
 griffke02:   23   Max.   :2013   Max.   :4.000   SDN    :  914   UA:    0  
 (Other)  :25250                                  (Other):19823             
       G            G_batting            AB              R         
 Min.   :  1.00   Min.   :  0.00   Min.   :  0.0   Min.   :  0.00  
 1st Qu.: 26.00   1st Qu.:  8.00   1st Qu.:  5.0   1st Qu.:  0.00  
 Median : 50.00   Median : 42.00   Median : 85.0   Median :  9.00  
 Mean   : 64.06   Mean   : 57.58   Mean   :182.4   Mean   : 24.71  
 3rd Qu.:101.00   3rd Qu.:101.00   3rd Qu.:336.0   3rd Qu.: 43.00  
 Max.   :163.00   Max.   :163.00   Max.   :716.0   Max.   :152.00  
                  NA's   :906      NA's   :2661    NA's   :2661    
       H               X2B              X3B               HR        
 Min.   :  0.00   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
 1st Qu.:  1.00   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000  
 Median : 19.00   Median : 3.000   Median : 0.000   Median : 1.000  
 Mean   : 48.18   Mean   : 9.276   Mean   : 1.033   Mean   : 5.369  
 3rd Qu.: 87.25   3rd Qu.:16.000   3rd Qu.: 1.000   3rd Qu.: 7.000  
 Max.   :262.00   Max.   :59.000   Max.   :23.000   Max.   :73.000  
 NA's   :2661     NA's   :2661     NA's   :2661     NA's   :2661    
      RBI               SB                CS              BB        
 Min.   :  0.00   Min.   :  0.000   Min.   : 0.00   Min.   :  0.00  
 1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.: 0.00   1st Qu.:  0.00  
 Median :  8.00   Median :  0.000   Median : 0.00   Median :  6.00  
 Mean   : 23.56   Mean   :  3.568   Mean   : 1.54   Mean   : 17.98  
 3rd Qu.: 39.00   3rd Qu.:  3.000   3rd Qu.: 2.00   3rd Qu.: 29.00  
 Max.   :165.00   Max.   :110.000   Max.   :29.00   Max.   :232.00  
 NA's   :2661     NA's   :2661      NA's   :2661    NA's   :2661    
       SO              IBB               HBP               SH        
 Min.   :  0.00   Min.   :  0.000   Min.   : 0.000   Min.   : 0.000  
 1st Qu.:  2.00   1st Qu.:  0.000   1st Qu.: 0.000   1st Qu.: 0.000  
 Median : 20.00   Median :  0.000   Median : 0.000   Median : 0.000  
 Mean   : 33.52   Mean   :  1.533   Mean   : 1.614   Mean   : 1.786  
 3rd Qu.: 55.00   3rd Qu.:  2.000   3rd Qu.: 2.000   3rd Qu.: 2.000  
 Max.   :223.00   Max.   :120.000   Max.   :35.000   Max.   :39.000  
 NA's   :2661     NA's   :2662      NA's   :2670     NA's   :2661    
       SF              GIDP            G_old              BA       
 Min.   : 0.000   Min.   : 0.000   Min.   :  0.00   Min.   :0.000  
 1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 20.00   1st Qu.:0.160  
 Median : 0.000   Median : 2.000   Median : 47.00   Median :0.242  
 Mean   : 1.554   Mean   : 4.127   Mean   : 61.43   Mean   :0.212  
 3rd Qu.: 2.000   3rd Qu.: 7.000   3rd Qu.:101.00   3rd Qu.:0.276  
 Max.   :17.000   Max.   :35.000   Max.   :163.00   Max.   :1.000  
 NA's   :2662     NA's   :2661     NA's   :3414     NA's   :5618   
      OBP             X1B             SLG           teamID.y     lgID.y    
 Min.   :0.000   Min.   :  0.0   Min.   :0.000   CLE    :  935   AL:12304  
 1st Qu.:0.208   1st Qu.:  0.0   1st Qu.:0.200   PIT    :  932   NL:13093  
 Median :0.305   Median : 13.0   Median :0.351   PHI    :  931             
 Mean   :0.270   Mean   : 32.5   Mean   :0.317   SDN    :  923             
 3rd Qu.:0.346   3rd Qu.: 59.0   3rd Qu.:0.432   LAN    :  921             
 Max.   :1.000   Max.   :225.0   Max.   :4.000   CIN    :  912             
 NA's   :5562    NA's   :2661    NA's   :5618    (Other):19843             
     salary        
 Min.   :       0  
 1st Qu.:  255000  
 Median :  550000  
 Mean   : 1879256  
 3rd Qu.: 2150000  
 Max.   :33000000  
                   

Analyzing the Lost Players

As previously mentioned, the Oakland A's lost 3 key players during the off-season. We'll want to get their stats to see what we have to replace. The players lost were: first baseman 2000 AL MVP Jason Giambi (giambja01) to the New York Yankees, outfielder Johnny Damon (damonjo01) to the Boston Red Sox and infielder Rainer Gustavo "Ray" Olmedo ('saenzol01').

Use the subset() function to get a data frame called lost_players from the combo data frame consisting of those 3 players. Hint: Try to figure out how to use %in% to avoid a bunch of or statements!

In [18]:
lost_players <- subset(combo,playerID %in% c('giambja01','damonjo01','saenzol01') )
In [19]:
lost_players
Out[19]:
playerIDyearIDstintteamID.xlgID.xGG_battingABRHX2BX3BHRRBISBCSBBSOIBBHBPSHSFGIDPG_oldBAOBPX1BSLGteamID.ylgID.ysalary
5135damonjo0119951KCAAL4747188325311532370122201232470.28191490.3235294340.4414894KCAAL109000
5136damonjo0119961KCAAL1451455176114022565025531643310541450.2707930.31294961070.3675048KCAAL180000
5137damonjo0119971KCAAL1461464727013012884816104270236131460.27542370.33783781020.3855932KCAAL240000
5138damonjo0119981KCAAL1611616421041783010186626125884443341610.27725860.33946251200.4392523KCAAL460000
5139damonjo0119991KCAAL145145583101179399147736667505334131450.30703260.37899541170.4768439KCAAL2100000
5140damonjo0120001KCAAL1591596551362144210168846965604181271590.32671760.38199181460.4946565KCAAL4000000
5141damonjo0120011OAKAL15515564410816534494927126170155471550.25621120.32352941180.363354OAKAL7100000
5142damonjo0120021BOSAL154154623118178341114633166570563541540.28571430.35622321190.4430177BOSAL7250000
5143damonjo0120031BOSAL14514560810316632612673066874426651450.27302630.34502921160.4046053BOSAL7500000
5144damonjo0120041BOSAL15015062112318935620941987671120381500.30434780.38034191280.4766506BOSAL8000000
5145damonjo0120051BOSAL14814862411719735610751815369320951480.31570510.36627911460.4391026BOSAL8250000
5146damonjo0120061NYAAL149149593115169355248025106785142541490.28499160.35874441050.4822934NYAAL13000000
5147damonjo0120071NYAAL1411415339314427212632736679121341410.27016890.35099341030.3958724NYAAL13000000
5148damonjo0120081NYAAL1431435559516827517712986482012151430.30270270.37520131190.4612613NYAAL13000000
5149damonjo0120091NYAAL1431435501071553632482120719812219NA0.28181820.3653846920.4890909NYAAL13000000
5150damonjo0120101DETAL14514553981146365851111699022215NA0.2708720.3551555970.4007421DETAL8000000
5151damonjo0120111TBAAL1501505827915229716731965192172541500.26116840.32558141000.4175258TBAAL5250000
7872giambja0119951OAKAL545417627457062521283103124540.25568180.3636364320.3977273OAKAL109000
7873giambja0119961OAKAL1401405368415640120790151953515151400.29104480.3551089950.4813433OAKAL120000
7874giambja0119971OAKAL1421425196615241220810155893608111420.29287090.3622449890.495183OAKAL205000
7875giambja0119981OAKAL153153562921662802711022811027509161530.29537370.38356161110.4893238OAKAL315000
7876giambja0119991OAKAL15815857511518136133123111051066708111580.31478260.42158271110.5530435OAKAL2103333
7877giambja0120001OAKAL152152510108170291431372013796690891520.33333330.4759036970.6470588OAKAL3103333
7878giambja0120011OAKAL154154520109178472381202012983241309171540.34230770.4769001910.6596154OAKAL4103333
7879giambja0120021NYAAL155155560120176341411222210911241505181550.31428570.43541361000.5982143NYAAL10428571
7880giambja0120031NYAAL1561565359713425041107211291409210591560.25046730.4115942680.5271028NYAAL11428571
7881giambja0120041NYAAL8080264335590124001476218035800.20833330.3416149340.3787879NYAAL12428571
7882giambja0120051NYAAL139139417741131403287001081095190171390.27098320.440367670.5347722NYAAL13428571
7883giambja0120061NYAAL139139446921132503711320110106121607101390.25336320.4127807510.558296NYAAL20428571
7884giambja0120071NYAAL8383254316080143910406628011830.23622050.3564356380.4330709NYAAL23428571
7885giambja0120081NYAAL14514545868113191329621761115220961450.24672490.3734513610.5021834NYAAL23428571
7886giambja0120091OAKAL83832693952130114000507217026NA0.19330860.3323171280.3643123OAKAL4000000
7887giambja0120092COLNL1919244710211007800000NA0.29166670.451612940.5833333OAKAL4000000
7888giambja0120101COLNL878717617439063520354756055NA0.24431820.3783784280.3977273COLNL1750000
7889giambja0120111COLNL6464131203460133200174503011640.2595420.3552632150.6030534COLNL1000000
7890giambja0120121COLNL60NA89720401800202422024NA0.22471910.3716814150.3033708COLNL1000000
7891giambja0120131CLEAL717118621348093101235604038NA0.18279570.2824074170.3709677CLEAL750000
20112saenzol0119991OAKAL979725541701801141112247115036970.27450980.3627119410.4745098OAKAL240000
20113saenzol0120001OAKAL7676214406712293310254027016760.31308410.4008097440.5140187OAKAL260000
20114saenzol0120011OAKAL10610630533672119320119641131391060.21967210.2911765360.3836066OAKAL290000
20115saenzol0120021OAKAL6868156154310161811133117022680.2756410.3539326260.4679487OAKAL800000
20116saenzol0120051LANNL109109319398424015630127631302121090.26332290.3247863450.4796238LANNL650000
20117saenzol0120061LANNL10310317930531501148001447170441030.29608940.3627451270.5642458LANNL1000000
20118saenzol0120071LANNL92921109215041800162502045920.19090910.2954545120.3454545LANNL1000000

Since all these players were lost in after 2001 in the offseason, let's only concern ourselves with the data from 2001.

Use subset again to only grab the rows where the yearID was 2001.

In [20]:
lost_players <- subset(lost_players,yearID == 2001)

Reduce the lost_players data frame to the following columns: playerID,H,X2B,X3B,HR,OBP,SLG,BA,AB

In [21]:
lost_players <- lost_players[,c('playerID','H','X2B','X3B','HR','OBP','SLG','BA','AB')]
In [22]:
head(lost_players)
Out[22]:
playerIDHX2BX3BHROBPSLGBAAB
5141damonjo0116534490.32352940.3633540.2562112644
7878giambja01178472380.47690010.65961540.3423077520
20114saenzol016721190.29117650.38360660.2196721305

Replacement Players

Now we have all the information we need! Here is your final task - Find Replacement Players for the key three players we lost! However, you have three constraints:

  • The total combined salary of the three players can not exceed 15 million dollars.
  • Their combined number of At Bats (AB) needs to be equal to or greater than the lost players.
  • Their mean OBP had to equal to or greater than the mean OBP of the lost players

Use the combo dataframe you previously created as the source of information! Remember to just use the 2001 subset of that dataframe. There's lost of different ways you can do this, so be creative! It should be relatively simple to find 3 players that satisfy the requirements, note that there are many correct combinations available!

Helpful info on sorting data frames


Example Solution

Note: There are lots of correct answers and ways to solve this!

First only grab available players from year 2001

In [38]:
library(dplyr)
avail.players <- filter(combo,yearID==2001)

Then I made a quick plot to see where I should cut-off for salary in respect to OBP:

In [39]:
library(ggplot2)
ggplot(avail.players,aes(x=OBP,y=salary)) + geom_point()
Warning message:
: Removed 168 rows containing missing values (geom_point).

Looks like there is no point in paying above 8 million or so (I'm just eyeballing this number). I'll choose that as a cutt off point. There are also a lot of players with OBP==0. Let's get rid of them too.

In [41]:
avail.players <- filter(avail.players,salary<8000000,OBP>0)

The total AB of the lost players is 1469. This is about 1500, meaning I should probably cut off my avail.players at 1500/3= 500 AB.

In [42]:
avail.players <- filter(avail.players,AB >= 500)

Now let's sort by OBP and see what we've got!

In [44]:
possible <- head(arrange(avail.players,desc(OBP)),10)

Grab columns I'm interested in:

In [45]:
possible <- possible[,c('playerID','OBP','AB','salary')]
In [46]:
possible
Out[46]:
playerIDOBPABsalary
1giambja010.47690015204103333
2heltoto010.43165475874950000
3berkmla010.4302326577305000
4gonzalu010.42857146094833333
5thomeji010.41614915267875000
6alomaro010.41467075757750000
7edmonji010.41021425006333333
8gilesbr020.40356085767333333
9pujolal010.402963590200000
10olerujo010.40117995726700000

Can't choose giambja again, but the other ones look good (2-4). I choose them!

In [47]:
possible[2:4,]
Out[47]:
playerIDOBPABsalary
2heltoto010.43165475874950000
3berkmla010.4302326577305000
4gonzalu010.42857146094833333

Great, looks like I just saved the 2001 Oakland A's a lot of money! If only I had a time machine and R, I could have made a lot of money in 2001 picking players!

Great Job!

Congratulations on your first project!